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Abstract 

The Flexible Image Transport System (FITS) standard has been a great boon to astronomy, allowing observatories, scientists 
and the public to exchange astronomical information easily. The FITS standard, however, is showing its age. Developed in the 
late 1970s, the FITS authors made a number of implementation choices that, while common at the time, are now seen to limit 
its utility with modern data. The authors of the FITS standard could not anticipate the challenges which we are facing today in 
astronomical computing. Difficulties we now face include, but are not limited to, addressing the need to handle an expanded range 
of specialized data product types (data models), being more conducive to the networked exchange and storage of data, handling 
very large datasets, and capturing significantly more complex metadata and data relationships. 

There are members of the community today who find some or all of these limitations unworkable, and have decided to move ahead 
with storing data in other formats. If this fragmentation continues, we risk abandoning the advantages of broad interoperability, and 
ready archivability, that the FITS format provides for astronomy. In this paper we detail some selected important problems which 
exist within the FITS standard today. These problems may provide insight into deeper underlying issues which reside in the format 
and we provide a discussion of some lessons learned. It is not our intention here to prescribe specific remedies to these issues; 
rather, it is to call attention of the FITS and greater astronomical computing communities to these problems in the hope that it will 
spur action to address them. 

Keywords: 

FITS, File formats. Standards 


Preprint submitted to Astronomy U Computing 


February 12, 2015 



1. Introduction 

The Flexible Image Transport System standard (FITS; Wells 
and Greisen 1979; Greisen et al. 1980; Wells et al. 1981; 
Greisen and Harten 1981 and Hanisch et al. 2001; and more re¬ 
cently, the definition of the version 3.0 FITS standard by Pence 
et al. 2010) has been a fundamental part of astronomical com¬ 
puting for a significant part of the past four decades. The FITS 
format became the central means to store and exchange astro¬ 
nomical data, and because of hard work by the FITS community 
it has become a relatively easy exercise for application writers, 
archivists, and end user scientists to interchange data and work 
productively on many computational astronomy problems. The 
success of FITS is such that it has even spread to other domains 
such as medical imaging and digitizing manuscripts in the Vat¬ 
ican Library (West and Cameron, 2006; Allegrezza, 2012). 

Although there have been some significant changes, the FITS 
standard has evolved very slowly since its genesis in the late 
1970s. New types of metadata conventions such as World Coor¬ 
dinate System (WCS; Greisen and Calabretta, 2002; Calabretta 
and Greisen, 2002; Greisen et al., 2006) representation and 
data serializations such as variable length binary tables (Cot¬ 
ton et al., 1995) have been added. Nevertheless, these changes 
have not been sufficient to match the greater evolution in astro¬ 
nomical research over the same period of time. 

Astronomical research now goes beyond the paradigm of a 
set of observational data being analyzed only by the scientific 
team who proposed or collected it. The community routinely 
combines original observations, theoretical calculations, obser¬ 
vations from others, and data from archives on the internet in or¬ 
der to obtain better and wider ranging scientific results. A wide 
variety of research projects now involve many diverse datasets 
from a range of sources. Instruments in astronomy now pro¬ 
duce several orders of magnitude larger datasets than were com¬ 
mon at the time FITS was born, in some cases requiring par¬ 
allelized, distributed storage systems to provide adequate data 
rates (Alexov et al., 2012). 

Astronomers have increasingly come to rely on others to 
write software programs to help process and analyze their data. 
Common libraries, analysis environments, pipeline processed 
data, applications and services provided by third parties form 
a crucial foundation for many astronomers’ toolboxes. All of 
this requires that the interchange of data between different tools 
needs to be as automated as possible, and that complex data 
models and metadata used in processing are maintained and un¬ 
derstood through the interchange. 

These changes in research practices pose new challenges for 
the 2F' century. We must address the need to handle an ex¬ 
panded range of specialized data product types and models, be 
more conducive to the distributed exchange and storage of data, 
handle very large datasets and provide a means to capture sig¬ 
nificantly more complex metadata and data relationships. 

A summary of these significant problems within the FITS 
standard was presented in Thomas et al. (2014). Already some 
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of these limitations have caused members of the community 
to seek more capable storage formats, both in the past, such 
as the Starlink Hierarchical Data System (HDS; Disney and 
Wallace, 1982; Jenness, 2015), the extensible Data Format 
(XDF; Shaya et al., 2001), FITSML (Thomas et al., 2001) and 
HDX (Giaretta et al., 2003); and in the present and future (e.g., 
HDF5 (Anderson et al., 2011) and NDF (Jenness et al., 2015, 
ascl: 1411.023)). There are other popular file formats among the 
radio and (sub-)millimeter astronomy community such as the 
Continuum and Line Analysis Single-dish Software (CLASS) 
data format associated with the Grenoble Image and Line Data 
Analysis Software (GILDAS) tools (ascl; 1305.010). Although 
this file format does not have a public specification, there are 
open-source spectroscopic software packages like PySpecKit 
(ascl; 1109.001) that support certain versions of the data format. 
Given the large amount of available storage formats, there is 
certainly a possibility that the use of FITS will fall in favor of 
other scientific data formats should it not adapt to these new 
challenges. 

The strengths of FITS are well known and include an easily 
understood serialization, a plethora of stable supporting soft¬ 
ware, good documentation of the format and the simple fact 
that it remains to this day the lingua franca of astronomical 
data format exchange. What we feel has been missing is an at¬ 
tempt within the community to dispassionately discuss and un¬ 
derstand FITS in terms of problems in its application to modem 
astronomical research. In this paper we hope to show that tech¬ 
nologies and research techniques in astronomy have evolved 
but FITS has not kept pace. As a result, gaps between FITS 
utility and the needs of the research community have opened 
up and widened over time. It is our intended goal in this pa¬ 
per to highlight some selected, important, problems which exist 
in the FITS core standard today. We have deliberately avoided 
proposing solutions to the problems we discuss, and we remain 
agnostic (because the authors are divided) on whether replacing 
FITS is an obviously good or an obviously bad idea. 

We present our argument in the following manner. The 
various issues have been grouped under the general topics 
“information interchange” (section 2), “data models” (sec¬ 
tion 3), “metadata and data representation” (section 4) and 
“large and/or distributed datasets” (section 5). We address each 
of these topics in turn below then try to provide an analysis of 
any deeper causes or “lessons learned” (section 6). A summary 
section ends the paper and provides an overview of our work 
and future direction. 

2. Information Interchange 

FITS originated as a delivery format for observatory data. It 
was the format of choice when transporting data between differ¬ 
ent data reduction environments such as IRAF (ascl;9911.002), 
Starlink (ascl; 1110.012), AIPS (ascl;9911.003) and MIDAS 
(ascl; 1302.017). 

In principle, FITS promotes interchange through its simple 
and easily understood format which holds its information in 
various levels of groupings of metadata and data blocks. Meta¬ 
data are captured via key-value pairs which are in turn grouped 


2 



into FITS headers. The first header is denoted as the ‘primary’ 
header and subsequent headers known as ‘extensions’. Headers 
may or may not be then grouped with data blocks. An example 
primary FITS header appears in Fig. 1. 

This simple arrangement of information can satisfy many use 
cases for transport, however, requirements for interchange have 
evolved. Effective interchange, as we shall illustrate, now in¬ 
cludes things like the ability to declare models for use in higher 
level processing, validation of models within the file and, at the 
most basic level, the ability to declare which version of the se¬ 
rialization is being used. 

These capabilities have been explored and implemented in 
several other data formats in astronomy. The Astronomical 
Data Center (ADC) XDF format, the Low-Frequency Array for 
Radio Astronomy (LOFAR) HDF5 data model (Alexov et al., 
2012), CASA measurement sets (Petry and CASA Develop¬ 
ment Team, 2012), RPFITS' from the Australian Telescope Na¬ 
tional Facility and Starlink’s NDF (Currie, 1988; Warren-Smith 
and Wallace, 1993; Economou et al., 2014) all serve as exam¬ 
ples in this regard. 

XDE was created primarily to support archiving, web-based 
use of published astronomical data and the development of 
EITSML - an XML version of the EITS data model which could 
use an XML schema for validation. NDE was developed in the 
late 1980s as a means of organizing the hierarchical structures 
that were available via the Starlink HDS format when it became 
apparent that arbitrary hierarchies could lead to chaos and lack 
of ability for applications to interoperate (Jenness et al., 2015). 
HDX (Giaretta et al., 2003) was developed around 2002 as a 
flexible way of layering high-level data structures, presented as 
a virtual XML Document Object Model (DOM), atop otherwise 
unstructured external data stores; this was in turn used to de¬ 
velop Starlink’s NDX framework, which (among other things) 
allowed EITS files to be viewed and manipulated using the con¬ 
cepts of the NDE format. HDE5 (Alexov et al., 2012) was cho¬ 
sen to accommodate LOEAR’s exceptional high data rates, 6- 
dimensional data complexity, distributed data processing and 
I/O parallelization needs. 

2.7. Format versioning 

There is no standard means for a EITS file to communicate 
the formatting version it conforms to. Consider the example pri¬ 
mary header in Eig. 1 ; the only keyword which implies any type 
of format is SIMPLE which is set to ‘T’, or true. The comment 
indicates that the file conforms to “Standard EITS format”, but 
what indeed is that ‘Standard’? 

The designers and maintainers of EITS have espoused the 
principle “once FITS, forever FITS” (see e.g., Grosbol et al., 
1988; Hanisch et al., 1993). Certainly some in the community 
see this as a strength for the format as it appears to promote 
long term stability and “archivability” of FITS data (Allegrezza, 
2012; Library of Congress, 2012). This is not, however, quite 
the same thing as saying that FITS is unversioned. There have 
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been at least three named descriptions of FITS. These include 
the first, or ‘basic FITS’ document (Wells and Greisen, 1979; 
Wells et al., 1981), the NOST version of FITS (Hanisch et al., 
2001), and the current version 3.0 (Pence et al., 2010). One 
can regard these as successive improvements of a document 
describing changing best practices for an unchanging format 
(compare “the value [of the putative FITS version keyword] is 
always 1.0 by default” in Wells (1997), which discusses this 
general point in some depth). However the fact remains that 
there are features in the most recent FITS description (such as 
64-bit integers, negative BITPIX values, FITS extensions and 
tables) which were not present in the first FITS version and 
demonstrably FITS has evolved. 

The “once FITS, forever FITS” doctrine may be taken to re¬ 
quire backward and forward compatibility or, if you will, com¬ 
patibility with all FITS files ever created in the case where 
there is only one version ever. Either way, backward com¬ 
patibility means that it always should be feasible to use the 
most recent EITS reader. Eor forward compatibility, at min¬ 
imum, reasonable expectation goes beyond requiring a EITS 
reader not crash when confronted with a newer EITS file; it 
should do more than this. Ideally, it should parse what parts of 
the file are still compliant with its understanding of the format 
and report on those parts/features of the file which it does not 
recognize. In either compatibility case, without unambiguous 
version metadata, readers have to rely on ‘duck-typing’ ^ and 
heuristics which are ultimately error prone because it requires 
the implementer of the parser to perfectly interpret the signa¬ 
ture of any particular set of features present in the given EITS 
instance from among other possible features which are absent. 
Eurthermore, as the format evolves beyond the date of its cre¬ 
ation, the software cannot know how that signature may change 
and may incorrectly identify the version, a clear difficulty for 
forward compatibility. The reliance on heuristics also has im¬ 
pact beyond writing a EITS parser. Euture archivists will cer¬ 
tainly want to know what version of the format they are dealing 
with without having to guess from ancillary evidence such as 
the presence of certain keywords, date of the file creation and 
so on. 

The lack of versioning also limits the ability of our com¬ 
munity to move forward constructively with developing new 
EITS versions. The “once EITS, forever EITS” doctrine re¬ 
quires we accrete more and more “design rules” which may 
limit our ability to implement new and needed features and 
clutter reader code. Consider that three keywords have been 
deprecated (BLOCKED, CR0TA2 and EPOCH) by the latest version 
of EITS. Per the standard, these are “obsolete structures that 
should not be used in new EITS files but which shall remain 
valid indefinitely”. As such, software writers must indefinitely 
be on guard for these metadata and writers of new conventions 
must avoid utilizing these specific keywords. As time passes 
and changes of this nature accumulate, it will be progressively 
harder to interpret FITS data correctly and write new conven¬ 
tions. 


^see http://en.Wikipedia.org/wiki/Duck_typing 
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SIMPLE 

= 

T 

/ 

Standard FITS format 

BITPIX 

= 

-32 

/ 

32 bit IEEE floating point numbers 

NAXIS 

= 

3 

/ 

Number of axes 

NAXISl 

= 

800 

/ 


NAXIS2 

= 

800 

/ 


NAXIS3 

= 

4 

/ 


EXTEND 

= 

T 

/ 

There may be standard extensions 

ATODGAIN 

= 

7.000000 

/ 

Analog to Digital Gain (Electrons/DN) 

RNDISE 

= 

1.010153 

/ 

Readout Noise (DN) 

EPOCH 

= 

49740.82869315 

/ 

exposure average time (Modified Julian Date) 

EXPTIME 

= 

2500.000000 

/ 

exposure duration (seconds)--calculated 

EXPO 

= 

1300.000000 

/ 

weighted average initial exposure time 

RSDPFILL 

= 

-250 

/ 

bad data fill value for calibrated images 

SATURATE 

= 

10237 

/ 

Data value at which saturation occurs 

TEMP 

= 

0 

/ 

Temperature (0=cold, l=warm) 

FILTNAMl 

= 

'F555W 

/ 

first filter name 

HSTPHOT 

END 

= 

T 

/ 

Preprocessed by HSTphot/mask 


Figure 1: Representative simple primary header of a FITS file showing an assortment of FITS keywords and their associated values. This header from 1995 uses a 
definition of the, now deprecated, EPOCH keyword that is at odds with the standard usage of the period but the lack of parsable units for the field make it hard for a 
computer parser to understand this. Bytes which contain data may or may not follow the END keyword of the header. 


Although the FITS format is apparently rather simple, on 
disk, the multiple versions of the format description, and the 
existence of numerous header conventions, mean that reading 
a FITS hie in full generality is a complicated and messy busi¬ 
ness. As there is no versioning mechanism to effectively declare 
deprecated structures hnally “illegal”, these complications and 
costs will only increase. 

2.2. Declaration and validation of content meaning 

Related to, but separate from, the lack of versioning of the 
serialization, is the lack of ability to declare the presence of 
data models and their associated meaning. By ‘data model’ we 
mean: 

“a description of the objects represented by a com¬ 
puter system together with their properties and rela¬ 
tionships; these are typically ‘real world’ objects such 
as products, suppliers, customers, and orders.^ ” 

Of course, objects in astronomy are more likely to involve 
things like observations, instruments, celestial coordinates and 
actual astronomical objects such as stars. Likely properties one 
will encounter in a FITS hie include things like observational 
parameters (start/end times), astronomical coordinates, name 
and properties of the observing instrumentation, and so forth. 
In FITS-speak, we can say that any FITS keyword outside those 
dehned in the FITS standard is a data model parameter, and col¬ 
lections of related FITS keywords form a data model. Ideally a 
data model should be associated with a given, unique, “name- 
space” so that collisions in naming of the models and requisite 
parameters are avoided. 


^Definition adopted from Wikipedia, see http://en.wikipedia.org/ 
wiki/Data_inodel 


Data models can provide a standard by which information 
(data and metadata) in the hie may be semantically and syntac¬ 
tically validated in software. Questions such as “are all of the 
required metadata/data structures present in the hie?” (e.g., all 
of the needed keywords occur in the correct places in the hie) 
and “are there any non-normative values in the hie?” (all meta¬ 
data/data values are within expected bounds) are both questions 
answered by syntactic validation, the conformance of informa¬ 
tion in the hie to one or more declared data models. The ques¬ 
tion of “how do these data (inter)relate with other data” (e.g., 
can named structures in the hie be associated in some manner 
with others in another hle/extension?) is one of semantic vali¬ 
dation. By conhrming that the hie is ‘valid’ in both senses, we 
may link the data model to the information in the hie, and hence 
answer the fundamental question “what does this data you gave 
me represent?” (e.g., lists of stars, tables of galaxies, images of 
dust clouds, etc). It is important to note that all of these ques¬ 
tions are critical to consumers of the hie. 

There is already evidence that the FITS community values 
and needs shared data models. There are many examples. WCS 
and some other FITS conventions such as OIFITS (Thureau 
et ah, 2006), MBFITS (Muders et ah, 2006), PSRFITS (Hotan 
et al., 2004), SDFITS (Garwood, 2000) and FITS-IDI (Greisen, 
2011) are data models. The declaration of keyword dictionar- 
ies^^ is also essentially an act of declaring one or more data 
model(s). 

Let us also note that it is not unreasonable to expect more 
than one model to appear within a hie. Consider data distributed 
by the Palomar Transient Factory. For these data to permit the 


‘^Some collected data dictionaries with FITS keywords may be seen at the 
GSFC FITS site, see http://flts.gsfc.nasa.gov/fits_dictionary. 
html 
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widest variety of software tools to understand the astrometric 
distortion in these images, keywords from both the “SIP” and 
“TPV” conventions are included (Shupe et ah, 2012). One con¬ 
vention expresses distortion polynomials in pixel space and the 
other in intermediate longitude and latitude, yet it is not imme¬ 
diately obvious which data model should be applied. 

All of these data models imply an associated “namespace” 
which is a means of declaring the origin of the data model so 
that we may disambiguate and/or associate declared properties 
between models. For example, separate namespaces should ex¬ 
ist for the two aforementioned astrometric distortion models in 
the example above. There are common problems which name¬ 
spaced models help to solve and even the ‘simple’ metadata in 
Fig. 1 illustrates this. 

Consider the TEMP keyword in the example. Without reading 
the comment associated with it, we cannot know if this is this a 
temperature or perhaps some type of temporary file or resource 
or something else. If it is a temperature then what is this the 
temperature of? What do the values ‘0’ and ‘1’ mean? Are 
these the only valid values for this keyword? TEMP is a likely 
keyword string to appear in other files, how do we know if the 
TEMP in the other files is the same one we see in the example? 

Clearly, it is a non-trivial matter for the machine to deter¬ 
mine whether these are the same properties and to know other 
important details for using this information. This problem is not 
isolated to a solitary bit of rogue metadata. We can ask simi¬ 
lar questions about most of the keywords in the example header. 
Namespaced data models help address these issues. With an ap¬ 
propriate namespace mechanism in place, it is possible to create 
a machine-readable mapping between the data models so that 
any software program can determine whether modell: TEMP is 
the same (or different) property as model2; TEMP. Namespac¬ 
ing mechanisms can both provide humans with documentation, 
and provide software with the means to look up model defini¬ 
tions (perhaps from remote locations), and thus apply syntactic 
or semantic validation rules for the information at hand. This 
will allow the program to answer the remainder of our posed 
questions above. 

These arguments indicate there is a pressing need for name¬ 
spaced data models, yet, the only way in which we can currently 
implement them is for a human to inspect the file, or to write 
special purpose software programs targeted to particular data 
models. Given the data volumes that we have in astronomy, the 
latter choice is in the direction we should go, but is not practical 
in the general case. 

The writing of generalized software programs to detect any 
data models present in a given FITS file is currently a diffi¬ 
cult task for many reasons. First and foremost, we must recog¬ 
nize that there are constantly new data models being created and 
modified. Some of these are documented in a human readable 
fashion but there are many more models which do not even meet 
this standard. Worse, due in part to the lack of good validation 
tools, the community has accepted many informal variants of 
existing models. These variants may both be documented or 
not but are a result of either accidental or intentional stretch¬ 
ing of the original metadata usage. The header in Fig. 1, for 
example, is an informal variant because of its non-standard use 


of the EPOCH keyword. Finally, there is the possible compli¬ 
cation of more than one data model being fully, or partially, 
present within a file. Without explicit signposts for the software 
to use, it is likely impossible to determine which data models 
are present and map information to appropriate meaning. 

3. Data models 

One of FITS strengths is that it includes some common data 
structures which are important in astronomy data. The FITS 
standard includes such things like “table” and “n-dimensional 
array” the latter which is used to model both images and data 
cubes. These items are really simple representations of the data 
at a primitive level, and are certainly needed for basic access 
to the information within the file. Even so these structures, by 
themselves, do not contain much in the way of necessary de¬ 
tail and semantic information which tells the consumer exactly 
what it is they are actually consuming. For this reason, they 
cannot be considered to be data models. 

The FITS standard does supply a data model, for example 
the aforementioned WCS may be considered to be part of it, 
and these standard semantics are generally regarded as another 
strong point of FITS. The other data formats we have previously 
mentioned vary in how extensive their core data model is. The 
range goes from HDF5, which does not supply any data model 
per se, to that of NDF which has very rich metadata in its data 
model. 

It is a matter of opinion as to whether more/richer detailed 
data models in the format standard are better or not. The NDF 
core data model metadata are certainly more detailed than the 
metadata in the FITS standard. On the other hand FITS is cer¬ 
tainly more widely adopted than NDF. Nevertheless, we be¬ 
lieve FITS would benefit from an expansion in its standard data 
model as there are certainly common semantics which may be 
found in other data formats (e.g. NDF, XDF, etc) and FITS- 
based model extensions (e.g. such as MBFITS or local data 
dictionaries) which the community can benefit from. 

In this section we detail some important missing (compo¬ 
nent) data models. 

3.1. Scientific Errors 

The measurement of physical properties with their associated 
uncertainties is fundamental to astronomical research. It is thus 
ironic that FITS, which is purposely designed for supporting 
astronomical research, has no standard data model for capturing 
information about scientific errors. 

We could easily list a great number of possible error types 
which might be useful but trying to encompass all of the needs 
of the community at once is likely to create an unwieldy data 
model. We suggest that the community needs to provide for 
the most common needs, and target that subset as a first, shared 
model. Earlier efforts which might inform and help this work 
include local data models at sites such as CADC (Dowler, 2012) 
and the error models implemented in other data formats like 
NDE (although see for example Meyerdierks, 1991), and soft¬ 
ware efforts underway in scientific programming communities 
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such as Astropy (Astropy Collaboration, 2013). Each of these 
has valuable insight into the requirements. 

Nevertheless, we can anticipate that the following general 
characteristics might be part of the model; 

• Allow for both metadata and data to have errors. 

• Allow for extensible classification of the error type. For 
example, “Gaussian” errors are also a subclass of “statisti¬ 
cal” errors. 

• Allow association of more than one error class/type per 
measurement. For example, allow for both systematic and 
statistical errors to be associated with each measurement. 

• Allow for additional properties to be associated with each 
error class. For example, “statistical” errors may have an 
assigned “sigma” value. 

3.2. Extended Coordinate Support 

The existing FITS WCS data models illustrate some of 
the limitations associated with FITS. The “once FITS, always 
FITS” idea required that the current WCS standards were de¬ 
veloped as an extension of the older AIPS standard, and so in¬ 
herited many of the inherent limitations of that system. Even so 
they took a long time to be agreed. They are complex yet in¬ 
complete and inflexible. They are inadequate for many modern 
telescopes, and restrict creative use of novel coordinate trans¬ 
formations in subsequent data analysis. For instance, raw data 
must handle more distortion issues than the FITS WCS standard 
projections can handle. There are some provisions for handling 
more arbitrary distortions, but they are either cumbersome or 
too simple. Perhaps the biggest limitation is that different trans¬ 
formations of coordinates cannot be combined in flexible ways. 
The user is effectively limited to choosing only one of the solu¬ 
tions available. 

This is unfortunate. Not only does it reduce the range of 
transformations that can be described, but it also makes it harder 
to decompose the total transformation into its component parts 
thus making understanding and manipulation of the total trans¬ 
formation harder. The alternative approach - a “toolkit”-style 
system that creates complex transformations by stacking sim¬ 
pler atomic mappings - is usually the most efficient representa¬ 
tion as far as data storage is concerned (for example AST, see 
below). 

To illustrate the problem consider the imaging data taken by 
the Hubble Space Telescope which require multiple distortion 
components (see e.g.. Hack et al., 2013). Some are small but 
discontinuous. Others are linear but time varying. There is no 
FITS WCS compatible solution that handles these needs well. 
As another example, SCUBA-2 raw data (see e.g., Holland 
et al., 2013) include focal plane distortions which are combined 
with other transformations but must also support the dynamic 
insertion of other distortion models when a Fourier transform 
spectrometer (Gom and Naylor, 2010) is placed in the beam. 

Another case with poor support is Integral Field Unit (IFU) 
data. Many of these datasets have discontinuous WCS mod¬ 
els. The only way to support these in FITS now is to explicitly 


map each pixel to the world coordinates. Besides being space 
inefficient, it is difficult to manipulate in any simple way. 

In addition to limiting tfie description of raw telescope data, 
FITS WCS also restricts wfiat can be done with such data dur¬ 
ing subsequent analysis. There are many potentially interesting 
transformations that would result in the final WCS being inex¬ 
pressible using the restrictive FITS model. For instance, trans¬ 
forming an image of an elliptical galaxy into polar or elliptical 
coordinates is currently not possible. Another case which is 
unworkable is an alternate coordinate system to an image to 
represent the pixel coordinates of a second image covering the 
same part of the sky. These may not be common requirements, 
but they illustrate the wide range of transformation that should 
be possible with a flexible WCS system. 

The inflexibility in the FITS solution arises from multiple is¬ 
sues, but lack of namespaces is a serious barrier to providing 
a more flexible solution. If one has multiple model compo¬ 
nents each with similar parameters, how does one distinguish 
between them? One may use the letter suffix, but that is also 
used to distinguish between alternate WCS models. The limi¬ 
tation on keyword sizes presents limitations on how many co¬ 
efficients can be supported. The lack of any explicit group¬ 
ing mechanism requires complex conventions on how to relate 
whole sets of keywords. With more modern structures, such 
contortions and limitations are not necessary. 

The reality is that to solve these problems, many software 
systems have chosen alternate solutions and save their WCS 
information in FITS files in otfier ways (or in separate files). 
For example, the AST library (Warren-Smith and Berry, 1998; 
Berry and Jenness, 2012) is not subject to these limitations, but 
is forced to use non-standard FITS keywords when serializing 
mappings to FITS files (see Fig. 2). 

3.3. History and Provenance 

The FITS standard encourages people to store processing his¬ 
tory information in the header using a pseudo-comment field 
named HISTORY. This works from the perspective of making 
the information available to a sufficiently interested human (as¬ 
suming that each step in the data processing adds information 
to the end of the history section of the header) but the free-form 
nature of the entries makes it essentially impossible for a soft¬ 
ware system to understand what was done to the data. This 
may be possible within the constraints of a single data reduc¬ 
tion environment but it is highly unlikely that the content of the 
HISTORY block can be understood by any other software pack¬ 
ages. History needs to be treated as a first-class citizen with 
a standardized way of registering important information such 
as the date, the software tool and any relevant arguments or 
switches. 

A related issue is data provenance; that is, sufficient records 
of how files were created to permit their reproduction. For 
a given processed data product it is, for example, impossible 
to determine which data files contributed to the creation of 
that product. While there is no metadata standard for spec¬ 
ifying this information in output files, experimental systems 
have been developed which, when fully developed, aim to of¬ 
fer programmatic interfaces that will simplify recording prove- 
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PLRLG_A 

= 

5.50788096462284 

/ 

ENDAST_ 

K = 

'SphMap ' 


/ 

MAPB_A 
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BEGAST_ 
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'CmpMap ' 
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NIN_D 
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'Mapping ' 
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/ 

NIN_E 

= 


3 

/ 

INVERT_ 

D = 


1 

/ 

ISA_J 

= 

'Mapping ' 


/ 

< 

1 

o 

= 

0.426766777415161 

/ 

< 

1 
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CM 

= 

0.572680760059142 

/ 

M3_A 
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-0.418237169285184 

/ 


Polar longitude (rad.s) 

End of object definition 
Second component Mapping 
Compound Mapping 
Number of input coordinates 
Number of output coordinates 
Mapping not inverted 

Mapping between coordinate systems 
First Mapping used in inverse direction 
First component Mapping 
Matrix transformation 
Number of input coordinates 
Mapping inverted 

Mapping between coordinate systems 
Forward matrix value 
Forward matrix value 
Forward matrix value 
Forward matrix value 


Figure 2: Example header of a representation of an AST WCS object in a FITS header when the mapping is too complex to be represented using the FITS-WCS 
standard. 


nance information. One such example is Provenance Aware 
Service Oriented Architecture (PASOA; Moreau et al., 2008, 
2011), an open source architecture already used in fields such as 
aerospace engineering. In brief, when applications are executed 
they produce documentation of the process recorded in a repos¬ 
itory of provenance records that are persisted in a database. In 
astronomy, PASOA was successfully demonstrated by integrat¬ 
ing it into the Pegasus workflow management system for run¬ 
ning the Montage mosaic engine (Groth et al., 2009). 

At the JCMT Science Archive (JSA; Gaudet et al., 2008; 
Economou et al., 2015) data are created with full provenance 
information using the native provenance tracking that is part 
of NDF (Jenness et al., 2009). This provenance includes ev¬ 
ery ancestor along with history information that contributed to 
each ancestor. When these files are converted to FITS for in¬ 
gestion into the JSA using the CAOM-2 data model (Redman 
and Dowler, 2013) the provenance is trimmed to include just 
the immediate parent files (using PRVnnnnn headers) and the 
observation identifiers of the root ancestor observations (using 
OBSnnnnn headers). The full richness of the provenance in¬ 
formation is available in FITS binary tables but the lack of a 
standard leaves this information hidden from applications other 
than the ones that created it originally. 

Finally, astronomy may benefit from methodologies used to 
develop provenance systems custom to Earth Science and re¬ 
mote sensing (Tilmes and Fleig, 2008; McCann and Gomes, 
2008). 

3.4. Data Quality 

One of the more pressing needs in our era of shared and dis¬ 
tributed data is the need to know which data are “good” or, to 
put it another way, of sufficient quality. We are long past the era 
when the data volume was so small that it is practical to down¬ 
load all of the possible data of interest and examine it locally. 


Some might insist that this is an easily solved problem. Sim¬ 
ply declare a keyword, like DQUALITY, and allow it to take a 
boolean value. To be sure, that example is an exaggeration, but 
it helps to illustrate that there is no single optimum between 
the virtue of simplicity and the vice of being simplistic. Data 
quality cannot be judged on a single, or even a small set, of pa¬ 
rameters. The data which are adequate for one type of use, may 
be wholly inadequate in another usage context. Consider that 
engineering data generally are unsuitable for science and vice 
versa. Science data may be unsuitable for other types of science 
(for example, studies of sky background vs. pointed source sci¬ 
ence). 

A data quality model then, should be an ensemble of com¬ 
mon statistical measures of the type of dataset which may be 
used to derive higher-level judgments of the quality/suitability 
of the data for some other declared purpose. There are many 
higher types of data quality models which will need be cre¬ 
ated from the lower-level measures (image data quality, pointed 
catalog data quality, etc) and from these particular, targeted, 
statistical measures data quality may be judged by the dataset 
consumer without directly examining the data themselves. 

3.5. Units 

A strength of FITS is that it includes support for units within 
its core standard. There are, however, limitations in the utility 
of the provided specification. 

First, while it syntactically flexible, there are a few specifica¬ 
tion ambiguities which could be resolved by an explicit gram¬ 
mar. This limitation has perhaps been one reason that others 
have felt the need to publish more explicit prescriptions for 
units (George and Angelini, 1995). Another limitation is that 
the model does not accommodate the full range of contempo¬ 
rary astronomical data. This is evident from the adoption of 
other units systems by some major archives such as the CDS 
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(Centre de Donnees astronomiques de Strasbourg^) which con¬ 
tains a large number of published astronomical tables (see the 
units section within Ochsenbein 2000). Finally, there is also a 
provenance issue to defining new units, since - to pick the ex¬ 
ample of the unit of ‘Jupiter radius’ - two different groups may 
prefer the mean or equatorial values for the radius, or in con¬ 
trast may regard it, not as simply an abbreviation for a certain 
number of kilometers, but instead as a distance whose value is 
determined at a certain atmospheric pressure level. 

The solution to these limitations is not simply to expand the 
list of recommended units since, as well as being slow, this fails 
to distinguish between, for example, different definitions of the 
second, or to communicate places where the distinction does or 
does not matter. 

The purely syntactical issues surrounding unit strings are be¬ 
ing addressed by the IVOA’s ‘VOUnits’ work (Demleitner et al., 
2014), but the higher level questions - of communicating and 
defining new units, of indicating documentation, and of con¬ 
verting between them in a scientifically meaningful manner - 
are out of scope for that work by design, since experience has 
shown them to be more contentious than one might expect. 
These questions should be taken up by the FITS community. 

Solutions then may involve cherry-picking VOUnits syntac¬ 
tical fixes and alteration of the units model to use some names¬ 
pacing mechanism which would help to disambiguate sources 
of extended units models found within a file. Finally, we be¬ 
lieve that an analysis of other more recent work, such as that 
found in Demleitner et al. and Ochsenbein can help to quickly 
round out the roster of standard units. 

4. Metadata and data representation 

What may be construed to be “FITS data” has changed sig¬ 
nificantly since the founding of FITS. The original FITS speci¬ 
fication mandated only the capture of astronomical images. Al¬ 
most a decade later, FITS extensions (Grosbol et al., 1988), 
allowing for gathering multiple related data structures in one 
FITS file, and ASCII tables (Harten et al., 1988) were intro¬ 
duced. Binary tables followed in the next decade (Cotton et al., 
1995). Changes have also occurred in metadata capture. Over 
the intervening years the FITS community has added new meta¬ 
data conventions such as HIERARCH (Wicenec et al., 2009) and 
GROUPING (Jennings et al., 2007; Jennings et al., 1995), which 
have allowed for greater flexibility in capturing metadata. This 
expansion in capability to serialize information is to be ex¬ 
pected and is all to the good. Nevertheless, as we shall illustrate 
below, the expansion is still insufficient relative to actual need. 

4.1. Rich metadata representation 

The basic element of metadata capture in a FITS file is the 
FITS “card” which comprises a keyword name, a keyword 
value and a comment. All FITS cards must contain the key¬ 
word name and keyword value pair while comments in cards 


^http://cds.u-strasbg.fr/ 


are optional. Comments may sometimes contain information 
about units of the metadata value. 

Metadata may comprise a rich assortment of data structures 
and single-valued metadata are only the beginning. There is a 
need to capture sets, lists, vectors and objects within metadata 
(to name only the most basic of structures). Yet FITS cards, 
without additional conventions, are only capable of capturing 
single, scalar keyword-value pairings. 

The expression of both objects and non-scalar, multi-valued 
keywords is difficult in FITS and data model designers have to 
resort to conventions to achieve this. Object storage is enabled 
in part by utilizing a hierarchical convention such as HIERARCH 
or the record-valued system proposed in the FITS distortion 
paper (Calabretta et al., in preparation). In order to hold a key¬ 
word with either a ‘set’ or ‘list’ value, a common local con¬ 
vention adopted is to create a set of keywords sharing the same 
base name followed by a integer value which may (or may not) 
indicate order of the values (such as ICMBOOl, ICMB002, and 
so on). Another example is the IRAF multispec format (see 
Valdes, 1993, and references therein) which uses this scheme 
to specify related world coordinate information (see Fig. 3 for 
an example). The AST library (Warren-Smith and Berry, 1998, 
and see also §3.2) takes a similar approach in converting the 
WCS objects into FITS headers when the transformations are 
too complex to be represented by standard WCS headers (see 
Fig. 2). 

There are also restrictions on the expression of scalar values 
in headers. Consider that FITS cards are limited to 80 charac¬ 
ters and FITS keyword names may be no longer than 8 charac¬ 
ters. The result of these constraints is that keyword values may 
be no longer than 68 characters. Of course, if you use all of 
the space for keyword values, then the comment, or keyword 
values longer than 68 characters will need another convention 
in order to capture it (such as creating a continuation line in 
the header using the CONTINUE convention (HEASARC FITS 
Working Group, 2007)). 

Let us now consider the impact of keyword name constraints. 
Not only are keyword names limited to a small set of charac¬ 
ters but keyword names are restricted to no more than 8 char¬ 
acters. Often these restrictions prevent clear labelling of the 
metadata element because authors are forced to map longer, 
more descriptive, names into the truncated size. Non-English 
speaking authors are additionally forced to map into the limited 
character set. If you doubt this leads to problems, try the fol¬ 
lowing experiment; open any non-trivial EITS file and scan the 
header. Unless you are an expert in the data models present in 
the file (and sometimes even if you are) it is easy to find that 
the cramped names of the keywords often leads to arcane and 
confusing metadata. 

These restrictions on the FITS card have impact on conven¬ 
tions with resulting limits on the utility of any implementation. 
Due to the limited namespace and size of the keywords, dif¬ 
ferent conventions often reuse the same keywords for differ¬ 
ent purposes. For example, compare the use of PV keywords 
in the products of the SCAMP tools (Bertin, 2006), used for 
polynomial distortion coefficients, to the more common PV key¬ 
words used in the WCS convention for generic parameter val- 
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WAT0_001= 

WAT1_001= 

WAT2_001= 

WAT2_002= 

WAT2_003= 

WAT2_004= 


'system = multispec ' 

'wtype=multispec label=Wavelength units=Angstroms' 

'wtype=multispec sped = "1 113 2 4955.4428886353510.055675655' 
'83 256 0. 23.22 31.27 1. 0. 2 4 1. 256. 4963.0163112090 5.676' 
'976664 -0.3191636898579552 -0.8169352858733255" spec2 ="2 112' 
'9.09" spec 3 = "3 111 2 5043.5" 


Figure 3: Example header from an IRAF multispec dataset indicating the use of multi-line headers that differs from the CONTINUE convention. 


ues. These two conventions, when used in the same file, cause 
ambiguity and incorrect representation of the data. It is true that 
FITS libraries will often provide a means to choose between 
one duplicate keyword or another so that in the strictest terms 
the issue may be resolved. Nevertheless, there is no guarantee 
that the author of the files intended resolution per the manner 
of library in use, nor is there any guarantee that different li¬ 
braries will resolve the matter in the same way so the same file 
may wind up with different meaning if read by different parsers. 
Finally, even if libraries were consistent in behavior, the origi¬ 
nal intention of the author is still ambiguous (for example, they 
may not want to resolve in favor of one model or the other and 
both models are important to keep). 

4.2. Expanded missing values support 

Missing values are a common feature of most datasets, and 
are distinct from invalid values (such as NaN or Not a Number) 
that may occur for example in floating point calculations. For 
images with integer data types, one can make use of the BLANK 
keyword to represent missing values, and for tables with inte¬ 
ger and string columns, one can make use of the TNULL header 
keyword. However, for floating point images or table columns, 
there is no mechanism for specifying missing values. This has 
led to the common use of NaN to represent missing floating 
point values. However, one should carefully distinguish be¬ 
tween true missing values (which in an image could indicate for 
example an area of sky that was not observed), versus an invalid 
value (represented by NaN) which may represent for example a 
saturated pixel; such a distinction is not currently possible in 
FITS. 

4.3. Data associations 

As data acquisition and data reduction systems have become 
more complex there has been a move to storing multiple image 
data components in extensions within a single FITS file. The 
FITS extension mechanism provides a scheme for having mul¬ 
tiple images but, as noted in Greisen (2003), in essentially a flat 
structure without hierarchy or inheritance. If you have nine im¬ 
ages in the file there is no way of indicating that three of them 
are data, three are an error and three are a quality mask. Indeed, 
there is no way of specifying which triplets are related. You can 
use the EXTNAME header to indicate relationships but this relies 
on convention and string parsing rather than being a standard 
part of the format. 

As a real world example of this problem, consider the data 
processing system for the Herschel Space Observatory which 


includes context products that serve as containers for groups of 
data products with each product capable of being mapped to 
a FITS file stored on disk (see the Herschel architecture and 
design document; Herschel Team, 2008). In particular, Her- 
schel’s observational data hierarchy allows all products associ¬ 
ated with an observation (telemetry, calibration, raw and pro¬ 
cessed data) to be linked with the capability of lazy loading of 
products from the archive “cloud”. Satisfying the requirement 
that all products are storable as FITS files has forced the links 
in these hierarchies to be specified in a very convoluted form, 
understandable only within the Herschel interactive processing 
environment (HIPE; Ott, 2010) and not by other FITS readers. 

Another approach to this situation is the conversion of NDF 
format files to FITS and back to NDF (Currie et al., 2012; Cur¬ 
rie, 1997). They demonstrated that you can represent a hier¬ 
archical data grouping in the FITS multi-extension format, but 
this is done using EXTNAME conventions combined with headers 
representing the extension level in the hierarchy and the type of 
component and so is not understood by other FITS tools. 

In both cases above, a standardized way of specifying rela¬ 
tionships between extensions would be extremely valuable to 
data and application interoperability. 

4.4. Declaring byte order 

The original FITS standard specification (Wells et al., 1981) 
requires that a series of consecutive bytes in multi-byte data 
items is stored in order of decreasing significance (known as big 
endian format). Sometimes the byte order needs to be checked 
and swapped to the opposite byte ordering (little endian format) 
in systems that do not support non-native data formats. This is 
the case in some implementations of FITS readers that do not 
use the cfitsio library (ascl: 1010.001) and which use C routines 
to implement other scientific capabilities. Programmers on lit¬ 
tle endian platforms who work with large data volumes may 
find that this limitation results in a performance penalty as mar¬ 
shaling data to and from the FITS big endian ordering will be 
required. This is a frequent problem for astronomical programs. 
Little endianness is found on x86 and x86-64 processors that are 
commonly used in universities and research laboratories. 

The inability to specify the byte order will obviously result in 
a need to byte swap data. In most cases, this is not a significant 
problem or impact on performance for modern software sys¬ 
tems and can be discounted. There is however, another, more 
significant issue tied to this limitation. The ability to wrap/- 
translate existing data products into FITS files, without repro¬ 
cessing them to the specified byte-order in the FITS standard, is 
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important. From the perspective of an archivist with the respon¬ 
sibility of preserving the records of astronomical observations, 
the less the data are altered, the more efficient and reliable the 
archival data management will be. 

4.5. Alternative encodings 

The allowed character set for metadata and data in FITS is 
overly restrictive and is limiting its application. The restrictions 
between metadata and data do not differ significantly. For meta¬ 
data, FITS only supports the 7-bit ASCII encoding for keyword 
values and comments. For data encoding, authors may again 
only use 7-bit ASCII for text (string) capture in either ASCII or 
binary FITS tables although the NULL character is allowed in 
certain cases in binary tables. 

The world of astronomy has evolved beyond capturing of sci¬ 
entific information in 7-bit ASCII encoding. The FITS com¬ 
munity has grown. Many data are captured by instruments 
designed, built and run by investigators based in non-English 
speaking countries and astronomical research has grown signif¬ 
icantly elsewhere in the world. Whether as original observa¬ 
tional data products, reduced data, information from new ser¬ 
vices or in capturing theoretical data, FITS is now required 
to hold data which are not exclusively originating in English- 
speaking countries. 

The current restrictive character set is an anachronism, par¬ 
ticularly considering that the most common language on the 
planet. Mandarin, cannot be used easily in a EITS file. Eorc- 
ing information into English can easily result in loss of valuable 
meaning, unnecessarily limit the audience who may use the file, 
or force the author to use some other format to store their data. 

Support for alternative encodings is needed. Simple issues 
which revolve around the value of keywords like the expres¬ 
sion of a person’s name (with accents, for example), or the abil¬ 
ity to use special scientific and mathematical symbols (like the 
angstrom symbol A or the degree symbol °) should be handled. 
Tabular text values should similarly be allowed alternative en¬ 
codings for the same reasons. Eurthermore, while not as crit¬ 
ical, the format should also allow keywords themselves to be 
expressed in a broader range of characters. 

5. Large or distributed datasets 

At the time when EITS was developed, the primary media 
used for archiving and transporting the data were tapes. A mag¬ 
netic tape is unlike a hard-drive in that it is a serial access de¬ 
vice. The concept of sequentially accessing the data was nat¬ 
urally adopted for the EITS data model. Although tapes are 
still widely used for archiving data, such an access mode is no 
longer commonly available as the files are usually transferred 
to a hard-drive before being accessed. What is more, the se¬ 
rial nature of EITS has become a significant bottleneck when it 
comes to working with large datasets. 

Consider that many new instruments, especially in radio as¬ 
tronomy (ASKAP; DeBoer et al., 2009, MWA; Tingay et al., 
2013, LOEAR; van Haarlem et al., 2013, and SKA; Cornwell 
and Humphreys, 2010) have been producing, or are planning 


to produce in the near future, spectral-imaging data-cubes of 
unprecedented volumes in the order of tens and hundreds of 
petabytes per year. Due to the increased spatial and frequency 
resolutions there are individual datasets which can now be ex¬ 
pected to be as large as tens of terabytes. 

Eor many reasons which we will detail below, EITS does not 
provide sufficient support for these types of large data. 

5.1. Parallel writefread operations 

Large datasets require parallel read/write operations to be 
processed on parallel computers. EITS, however, cannot sup¬ 
port optimization for parallel read/write operations. 

This has been the driving factor for LOEAR to invest a sig¬ 
nificant effort into development of a new format using HDE5 
(Alexov et al., 2012). Most of LOEAR’s standard data prod¬ 
ucts are now stored using the HDE5 format, as well as HDE5 
analogs for traditional radio data structures such as visibility 
data and spectral image cubes. The HDE5 libraries allow for 
the construction of distributed files and impose no limits on 
their sizes. The nature of the HDE5 format further provides the 
ability to custom design a data encapsulation format, specifying 
hierarchies, content and attributes. 

5.2. Streaming imaging data 

While EITS supports cutout capabilities, serving large 
datasets to an end user requires support for multiple data rep¬ 
resentations of the same data (e.g. multiple resolutions or fi¬ 
delities) that may aid in visual exploration of multi-petabyte 
imaging data. It should also be possible to stream the data pro¬ 
gressively to the end-user, displaying an image as soon as the 
first data become available. Kitaeff et al. (2015) and Peters and 
Kitaeff (2014) demonstrate the applicability and effectiveness 
of such approach on radio astronomy imagery. 

5.3. Capturing indeterminately sized datasets via streaming 

Erequently there is a need to store data from an instrument or 
remote site that is being transmitted over a network. It is com¬ 
mon that when the transfer begins the final size of the dataset 
is not known. Those using EITS have handled this by writing 
such data to a file without specifying the size of the last dimen¬ 
sion in an image or table, and when the stream is completed, 
the header is appropriately updated. 

Nevertheless, there are applications for which one would like 
to access all the other information before the file is complete. 
This may be to integrate the data that are being read out, or to 
monitor metadata. A library supporting the data format should 
support such usage. 

5.4. Virtual and distributed datasets 

When EITS was created, the ‘file’ (bytes stored on durable 
physical medium such as spinning disk or magnetic tape) was 
more or less the only way to store and transfer data. The 
networked solutions which we enjoy today were absent from 
the world of astronomy and storage of astronomical data in 
databases was unusual. Code was run locally by experts and 


10 


the results, if shared at all, were usually only reported in pub¬ 
lished papers. In the intervening years, computer and infor¬ 
mation technologies have evolved and broadened; we now en¬ 
joy many new means of accessing, providing and storing data. 
FITS should join this revolution. 

We should start to consider thinking of FITS as a ‘container’ 
of astronomical information which is not necessarily a hie. Is 
there any reason to prevent our FITS ‘hie’ from overlaying a 
portion of a database? Why not allow FITS to be a wrapper 
about bytes held within a distributed mass store such as iRODS® 
(see e.g., Rajasekar et ah, 2007) or a cloud? Similarly, we 
would want FITS to contain, and adequately access, data gen¬ 
erated by a service (simulation data, for example). More spec¬ 
ulatively, FITS could itself execute simple stored algorithms to 
generate a portion of its data.^ 

These use cases are only examples of where we might go 
and some may be arguably of limited value. Nevertheless, the 
generalized use case that may be derived from all of these is 
certainly of importance: a science data storage format should 
be able to support both local and remote data access, providing 
immediate and secure access to data contained within itself, and 
providing transparent access to data held in non-local entities 
such as cloud storage, databases, services, and other hies. 

6. Discussion - Lessons Learned 

What then are the lessons we might draw from the above 
analysis of FITS? Are there any deeper issues and commonal¬ 
ities which thread throughout these issues? In fact, there are 
several. 

6.1. Lesson 1. The format should be versioned 

Contrary to the conclusion drawn in Wells (1997) the hrst les¬ 
son that we may draw is that the format needs to be versioned. 
As we have argued in this paper, we disagree with the premise 
that FITS has never undergone signihcant change and hence, 
there is only one version. Without versioning, it becomes a 
signihcantly harder task to write parsers for the format, requir¬ 
ing the software developer to encompass as many design rules 
as possible in order to robustly handle format instances. As 
the format evolves and adds new design rules, it only becomes 
more difficult to write the next parser and, just as bad, older 
parsers of the format may fail quietly. Ultimately, this expe¬ 
rience is contrary to the espoused goal of archivability as one 
can ultimately never know for certain which permutation of the 
format the FITS file being read conforms to. 

In contrast, when versioning is present, implementers of 
parsers are able to target a subset of the format design, and de¬ 
clare that within the software so that, should it inadvertently be 
used on a version it does not understand, it may fail gracefully 
and in a planned manner. 


^’http: //irods. org 

^This probably implies the need for a “FITS language” to generate these 
data. 


Because it is so important to understanding the design of the 
format, versioning metadata should be part of the standard. The 
choice to implement this as an optional add-on data model (such 
as a FITS convention) is to be avoided. This is because, with¬ 
out the enforcement of being part of the standard, versioning 
is unlikely to be implemented where it is needed most, in the 
generation of new instances. 


6.2. Lesson 2. The format should be self-describing 

The next lesson that we may draw is that the format needs 
to be “self-describing” in a machine-readable manner. We con¬ 
sider a self-describing format to be one where the formatted 
instance is capable of conveying and validating the semantic 
information it holds where the formatted “instance” may be a 
hie, or a collection of related hies or perhaps something more 
exotic (see above section 5.4). 

As we have already seen, FITS lacks semantic validation and 
its syntactic validation is very limited, achieved only by the cre¬ 
ation of hard-coded rules in software utilities such as fitsverify 
(part of the ftools package; ascl:9912.002) Furthermore, the 
limitation on keyword length to 8 characters all but guarantees 
that semantic information within the header is obfuscated. As 
we have shown, this in part contributes to the problem of being 
able to detect, and implement, multiple data models within a 
single FITS hie and can lead to the inadvertent creation of in¬ 
formal (and undetectable) variants. Furthermore, without this 
validation, archiving and interchange of information in the for¬ 
mat suffers. It is harder to build robust software systems as any 
components involved in the interchange of FITS are unable to 
adequately detect, and handle, invalid hies fed to it. 

The declaration of validation rules should be hexible. In 
FITS, where syntactic rules are hard-coded, it is not possible 
to declare syntactic rules which check the range or data type of 
metadata helds* without re-coding the utility. Ideally, the data 
format should not rely on hard-coding these rules in software. 
Rather, a means to capture and associate the data model/names¬ 
pace information with the contents of the formatted instance, in 
a machine-readable manner, should be found so that validation 
can be possible without human inspection or specihcally writ¬ 
ten software programs (similar, perhaps, in the way that JSON 
or XML formats have schemata). 

This approach has additional benehts downstream. First, it 
will help to avoid misinterpretation because it is better for the 
creator of the data and/or data model to provide the machine- 
readable information rather than a downstream programmer. 
Second, there is a saving in effort in that the model is done once 
and need not be repeated by numerous downstream program¬ 
mers. Finally, good validation tools will allow the community 
to better detect informal variant models and reject them, pro¬ 
moting good practice. 


^Beyond the few canonical keywords which are part of the FITS standard 
such as GCDUNT or PCOUNT 
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6.3. Lesson 3. The format should not limit expression of desired 
data models 

FITS was originally designed around a data model which 
contained a single, generic, two dimensional image and an asso¬ 
ciated header for metadata. This basic data model has been ex¬ 
panded to allow more instances (extensions), as well as types of 
data (data cubes and tables). As we discussed in prior sections, 
working with the basic data model of FITS, authors have imple¬ 
mented their own data models with ever greater demand being 
placed on the type of data (models) FITS may hold. Holding 
the line on format changes via the “once FITS, forever FITS” 
doctrine has been harmful. Original format design decisions 
have largely been held onto, and have limited the expression of 
new user data models. 

There are two general classes of problem which have held 
back realizing many needed models. One class concerns the 
limits created by the format of the serialization itself. Specif¬ 
ically, we mean the limits on metadata representation enumer¬ 
ated in section 4.1 and character encoding in section 4.5. This 
class causes difficulties in realizing the WCS model, for exam¬ 
ple. 

The other class of problem is that some needed machinery for 
data modeling within the FITS standard itself is missing. Be¬ 
yond the aforementioned need to declare models in a machine- 
readable manner (detailed in Lesson 2), this class encompasses 
a broader range of issues which include the missing ability to 
declare byte order, no standard means to make associations be¬ 
tween data and metadata, and an inability to create data models 
which extend both metadata and the data. We have already 
discussed the former two issues in sections 4.4 and 4.3 respec¬ 
tively. The last issue is related to the fact that there is no pro¬ 
vision in the standard for extending the existing capture of data 
itself. For example, if one creates a convention for a new image 
type which supports multiple representations of the data (sec¬ 
tion 5.2) it is no longer readable by any FITS parser. Other 
limitations which arise and/or are unsolvable as a result of this 
class of problem include no support for optimization of parallel 
lO (section 5.1), streaming issues (sections 5.2 and 5.3), lack 
of distributed/virtual data representation and representation of 
information held in “non-hle” instances such as in a database 
(section 5.4). 

It is critical that this data modeling machinery be integral to 
the format. As we have already noted, in many of the above 
cases it is possible to create a solution using one or more con¬ 
ventions, but doing so will result in hies which are unparsable 
by downstream readers which do not implement these conven¬ 
tions. In addition, there is no apparent solution, using a FITS 
convention, than can solve the limitation on the restricted ex¬ 
pression of keyword names, nor can one utilize conventions to 
describe how to serialize data. 

The data format should do as little to impede the expression 
of data models. In practice this means having very few hard¬ 
coded rules within the format itself such as the 2880 byte record 
or 80 byte card. Schemata should be capable of describing the 
layout of data and metadata and the format serialization should 
be hexible enough to handle desired data models and associa¬ 
tions between them. 


6.4. Lesson 4. Conventions are not standards 

As currently envisaged, conventions have no path of migra¬ 
tion to become part of the FITS standard. There are many use¬ 
ful conventions that provide features that many people use but 
no-one can guarantee that a particular convention will be sup¬ 
ported by a FITS reader. A FITS library cannot simply state 
that a particular version of the standard is supported but must 
also state all the conventions that are supported. Multiple con¬ 
ventions exist for continuing long header lines (multi-spec and 
CONTINUE) and for supporting hierarchical headers (HIERARCH 
and record-valued) but the standard does not have anything to 
say as to which convention is preferred. Tile compression is 
rightfully thought of as a success for FITS but again tile com¬ 
pression (e.g.. Seaman et ah, 2007; Pence et ah, 2009) is a con¬ 
vention and not a standard with no guarantee that a particular 
reader will be able to understand the compression scheme. 

CFiTsio, because it is so heavily relied upon by the community 
to implement FITS reader software, may be considered to be 
the de facto implementation of FITS. This fact, along with the 
lack of a migration path for conventions, effectively means it is 
also acting as a de facto standards body, cfitsio supports some 
conventions but not others. Effectively, the conventions it sup¬ 
ports have become mainstream while others have not. This is 
the wrong process for making worthy conventions widespread. 
We feel that it is important that the lessons learned from imple¬ 
menting these conventions provide feedback to the standards 
process to allow the standard to continue to grow and evolve 
over time. 

7. Summary 

The limitations which we have described in this paper are sig- 
nihcant and we have tried to provide an analysis of their deeper 
origin. From our investigation, it is clear that FITS suffers from 
a lack of sufficient evolution. Original design decisions, such 
as the header byte layout and fixed character encoding made a 
certain sense at the time FITS was founded. The later enshrine¬ 
ment of the FITS “Once FITS, always FITS” doctrine, which 
has been utilized to effectively freeze the format, was a mis¬ 
take in our opinion. Adherence to the doctrine, and lack of any 
means to version the format in a machine-readable manner, has 
stifled necessary change of FITS. 

More positively, the limitations identified in FITS provide 
an opportunity to draw a number of important lessons to be 
learned from the FITS experience. Furthermore, we can use our 
analysis to identify root causes and turn these into requirements 
which might be used as goals for future work. For example, 
some possible requirements might include; 

• The data format shall be versioned. 

• The data format shall allow for syntactic and semantic val¬ 
idation. 

• The data format standard contain provision for declaration 
of common advanced data structures. These structures in¬ 
clude non-scalar values, sets, objects and associations be¬ 
tween other metadata and data. 
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• The data format shall allow declaration of the character set 
(in both metadata and data). 

• The data format shall allow user models to declare new 
data structures in data®. 

We should not neglect FITS strengths either. In particular, 
FITS inclusion of semantic content (data models) as part of the 
standard should continue to be pushed. There are many critical 
missing data models (section 3) that, if included in the standard, 
would provide compelling reasons to use a data format. 

To be clear, we do not wish to recommend specific corrective 
action to any particular problem or derived requirement. Instead 
we hope that action will flow from constructive community dis¬ 
cussion including the analysis of other astronomical data for¬ 
mats and any lessons learned from their use and construction. 
We may anticipate that the form of the possible resolutions to 
problems in FITS may involve moving existing FITS conven¬ 
tions into the core standard, modification of the FITS standard 
to remove limitations, or even transferring the FITS data model 
over into a new serialization. 

Our effort will also continue. We plan to extend the work 
started here in a future paper in which we will gather use 
cases or “lessons learned” which also show FITS strengths, and 
gleaning the same from other data formats. From these we plan 
to extract and publish an overview of the requirements for a 
modern astronomical data format. 
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